Co-designing ab initio electronic structure methods on a RISC-V vector architecture

Ab initio electronic structure applications are among the most widely used in High-Performance Computing (HPC), and the eigenvalue problem is often their main computational bottleneck. This article presents our initial efforts in porting these codes to a RISC-V prototype platform leveraging a wide Vector Processing Unit (VPU). Our software tester is based on a mini-app extracted from the ELPA eigensolver library. The user-space Vehave and a RISC-V vector architecture implemented on an FPGA were tested. Metrics from both systems and different vectorisation strategies were extracted, ranging from the most simple and portable one (using autovectorisation and assisting this by fusing loops in the code) to the more complex one (using intrinsics). We observed a progressive reduction in the number of vectorial instructions, executed instructions and computing cycles with the different methodologies, which will lead to a substantial speed-up in the calculations. The obtained outcomes are crucial in advancing the porting of computational materials and molecular science codes to (post)-exascale architectures using RISC-V-based technologies fully developed within the EU. Our evaluation also provides valuable feedback for hardware designers, engineers and compiler developers, making this use case pivotal for co-design efforts.


Introduction
Ab initio electronic structure applications are among the most popular and computationally demanding in High-Performance Computing (HPC).In recent years, developers have put substantial efforts into modularising codes, moving from rather extensive and sometimes monolithic applications toward more structured software.This new design intends to be more easily adapted to incorporate external libraries for some critical or computationally expensive parts of the codes.This adaptation will facilitate the evolution of codes to the (post-)exascale era and to perform efficiently heterogeneous computing systems 1 .
Eigenvalue problems are key in ab initio electronic structure calculations when solving the Schrödinger equation for many-body extended systems.Eigensolvers are often the main computational bottleneck in Density Functional theory (DFT) calculations, taking over 90% of the total computational time in relatively large systems, limiting the size and complexity of the model in practice.On this ground, the ELPA (Eigenvalue soLvers for Petaflop Applications) library is designed to solve the eigenvalue problem efficiently, supporting efficient CPU and GPU performance on all major HPC platforms [2][3][4] .ELPA is employed by some of the most widely-used DFT codes, such as VASP 5 , Siesta 6 , Quantum Espresso 7 , Abinit 8 , exciting 9 , FHI-aims 10 , GPAW 11 or CP2K 12 .The library can be directly implemented within the code or incorporated as a part of wider open-source software library solutions, such as ELSI 13 or SIRIUS 14 .
Exascale computing represents a significant advancement HPC, unlocking unprecedented opportunities to transform materials and molecular modeling, allowing more accurate calculations, complex morphologies and exploration of large data sets to discover novel compounds 15 .However, this evolution will come with more heterogeneity in computer architectures, incorporating specialised hardware for specific applications 16 .Therefore, bidirectional efforts in co-design of hardware and software are required from early-developed prototype systems toward an efficient transition to the post-exascale era.
Our paper describes the initial steps to port the ELPA eigensolver to a prototype platform called EPAC-VEC powered by a RISC-V core coupled with a wide vector unit.This architecture is part of EPAC, a collection of RISC-V based accelerators implemented in a test chip fabricated during 2023 at 22nm within the European Processor Initiative (EPI) project.EPI seeks to reinforce Europe's digital sovereignty by developing high-performance, energy-efficient processors for supercomputers.EPAC-VEC is an extremely relevant architecture for general purpose HPC since it features a vector unit capable of handling vectors of up to 256 double-precision elements (16,384 bits per vector register) 17 , 32 times larger than current Single Instruction/Multiple Data (SIMD) architectures such as Intel's AVX512 extension.This long-vector architecture has already been used to accelerate other HPC workloads such as fast fourier transforms (FFT) 18 , sparse matrix-vector multiplication 19 , graph processing algorithms 20 , and computational fluid dynamics (CDF) 21 .Given the relevance of the eigensolver, optimisations made on ELPA will eventually benefit the wider community and pave the way for the future portability of electronic structure applications to RISC-V architectures.On the other hand, as a cornerstone of the co-design process, the outcomes collected during landmark work are also beneficial in guiding the development of future hardware and compilers.

Methodology
Our performance analyses were carried using the so-called Software Development Vehicles (SDV) 22 .This set of platforms, compilers, and analysis tools allow software developers to run applications on early iterations of the hardware, providing constant feedback to architecture design and the compiler development team, which guarantees the possibility of quickly improving the platform design.
The initial executions were performed on Vehave 22 , a user-space emulator for the RISC-V Instruction Set Architecture (ISA) vector extension.Vehave runs on top of the RISC-V commercial platforms, intercepting the vectorial instructions, decoding them, and emulating the vector extension.The emulator relies on LLVM 23 libraries for instruction decoding and generates detailed Paraver 24 trace files containing information about each emulated vector instruction.In addition to the Vehave platform, where vectorial instructions are emulated, we also used a field-programmable gate array (FPGA)-based emulation of the EPAC-VEC chip 22 .This FPGA is used as user-defined reconfigurable hardware platform emulating the EPAC-VEC test chip.The compiler, based on LLVM, supports autovectorisation and provides built-ins for vector instructions.A reference of the vectorial EPI intrinsics can be consulted in this link 25 .
Given the limitation of a single (emulated) computing core, the early-stage development of the compiler and the limited availability of libraries, this work has been performed using a mini-app extracted from ELPA, representing a small fraction of the code that retains the primary performance-intensive section.Our mini-app was inspired by a broader suite, developed in the NOMAD Center of Excellence 26 framework, to drive co-design in ab initio electronic structure calculations.More details on the mini-apps development and execution are given in our recent publication 27 .Our code isolates the trans_ev_tridi_to_band subroutine in ELPA (v.2022.05.001), extracted from the two-stage tridiagonalisation 2 .This method is normally preferred in large problems and when most eigenvectors must be computed.The kernel was selected based on its computationally cost, independence from external functions (Basic Linear Algebra Subroutines (BLAS) library or communications do not dominate the function), and especially to the extensive effort the ELPA developers made to adapt this kernel to use vectorial instruction on different hardware efficiently (i.e., SSE, AVX(2/512), SPARC64 SSE, ARM SVE(128/256/512), BlueGene/(P/Q), NVIDIA, AMD and Intel GPUs).All the code developed within this project, the indications to execute the mini-app on the RISC-V environment and the ELPA checkpoints for matrices with different sizes are accessible from the associated online repositories.

Results and discussion
Here, we describe the iterative steps toward adapting the ELPA mini-app to run efficiently on a long vector architecture.The ELPA kernels were initially converted from Fortran to C. The C version of the code allowed full compatibility with version 0.7.1 of the EPI LLVM compiler and the execution on the FPGA platform.After that, the first step was compiling the scalar mini-app on a commercial RISC-V core.This was done to verify the code's compatibility with the RISC-V architecture, that the compiler supports all data structures and code features, and that the instructions are equivalent in both the C and Fortran versions.
Our initial testing on the VPU used the Vehave emulator.While this application does not give access to measuring computing cycles or execution time, the traces provide the number and type of each executed vector instruction, which is valuable insight for studying code regions with vectorisation potential.Based on that information, we have studied, analysed and enabled the vectorisation of ELPA kernels.
The vectorisation was achieved using three different strategies.These are: (i) enabling compiler auto-vectorisation capabilities, (ii) helping the compiler fuse loops, and (iii) vectorising manually using intrinsics.This bottom-up approach offers several possibilities, from the most simple and portable method to the more complex and time-consuming process for the software engineer.The increasing performance is expected to evolve along with the complexity of the solution.
By leveraging the compiler's autovectorisation capabilities in the ELPA mini-app's original version, Vehave performance analyses counted for a total of 103,784 vector instructions.However, the traces show that most of the work is done with a vector length of 48 double-precision elements.This happens because ELPA was designed to divide its Q matrix into 48-element stripes.This implementation is very efficient for exploiting memory locality.However, long-vector architectures are more resistant to memory latency 28 , and would rather benefit from a long vector length.Therefore, this striped distribution came out to be suboptimal for the VPU.Subsequently, by suppressing the Q-stripes, the algorithm starts leveraging its full vector length (256 elements per vector) without limitations of 48 elements, and the vector instructions are reduced by a factor of 6× compared to the original version (from 103,784 to 17,304 vector instructions).We should note that we have verified that the outcomes from the Fortran and C versions were equivalent at this stage.The number of vector instructions for each code version is presented in Figure 1, where the different instruction types are noted in the caption.
Upon inspection of the compilation output, we still identify several loops in which the compiler is not able to reuse loaded vectors, so memory accesses are replicated in a sub-optimal manner.Therefore, our next strategy consisted of assisting the compiler by identifying loops going through the same variables and combining (fusing) them in the code.This strategy allows us to reduce our memory accesses (vle calls in Figure 1) from 7,040 to 6,064.With these changes, the total number of instructions improves by a factor of 6.7×, an additional 11% with respect to the previous version (from 17,304 to 15,405).Moreover, this strategy is expected to improve the performance of any vectorial compiler.
Our third and more complex method was using intrinsics, which, in some circumstances, can outperform the simple autovectorisation porting approach.The use of intrinsics allows the expansion of vectorisation, creating a pipeline of vectorial instructions in an outer loop, further reducing the number of memory accesses.This strategy further reduces the vector instructions to 9,499 achieving an improvement of 13.4× compared to the initial version.
However, not all instructions have the same computational cost, so converting from the vectorial instructions counters to computing time (or speed up) may not be straightforward.We observe that our modifications managed to substantially reduce the number of memory accesses (loads (vle) and store (vse)), which are among the most computationally costly instructions.In fact, almost 2/3 of the vle and vse have been suppressed in the version with intrinsics.On the other hand, the number of fused multiply-add instructions (vfmadd and vfmacc) remains almost constant.The settings of vector length (vsetvli) are also progressively reduced; however, their overall contribution to the total number of cycles is almost negligible compared to other instructions.In summary, we can conclude that our results show that all instruction types are being consistently reduced, guaranteeing the improved efficiency of the new implementation.
After our preliminary study with Vehave, we used an experimental platform to evaluate the RISC-V VPU, composed of an FPGA, and a host-x86 server used to program and communicate with it.In addition to measuring hardware counters such as the number of vector instructions, running on the FPGA allows one to obtain cycle-accurate time measurements.This metric enables a more straight-forward interpretation of how much the code's efficiency was improved at each implementation.The outcomes of our analyses are presented in Figure 2.
In this case, the scalar version exhibits almost a one-toone ratio between cycles (227,443,896) and instructions (231,667,638).Compared to the scalar version, the autovectorised one reduces the cycles and instructions by a factor of 9.7× and 54.5×, respectively.The efficiency was further improved in the version with fused loops, obtaining an overall speedup of 9.9× in cycles and 96.1× in instructions, and 23.9× and 210×, respectively, using intrinsics.

Conclusions
In this short communication, we describe our efforts to inform the implementation of new (post)-exascale HPC systems based on RISC-V.Our porting is centred around the ELPA eigensolver, a library used by many of the most widely-used ab initio electronic structure codes.Therefore, our optimisations will eventually benefit the whole community and pave the way for the future portability of electronic structure packages.On the other hand, as a cornerstone of the co-design process, the outcomes collected from our benchmarks also guide the ongoing development of this future HPC hardware and its compilers.Our porting work was done on the RISC-V core with a VPU developed at the Barcelona Supercomputing Center (BSC) within the framework of the European Processor Initiative (EPI) 29 .The most revolutionary element in the design of this chip is the inclusion of a vector unit capable of handling vectors of up to 256 double-precision elements, compared to, for example, the AVX-512 SIMD extension from Intel that handles up to 8 doubles.Our testing was carried out using the so-called Software Development Vehicles (SDVs), which allowed us to test our software on the most up-to-date version of the hardware, providing continuous feedback to the architects and compiler developers and guaranteeing the overall improvement of the EPI design.
Our manuscript summarises the iterative steps for improving the performance of a complex HPC library leveraging a RISC-V-based VPU prototype.Our tests used a tool called Vehave, a user-space emulator of the RISC-V vector extension, and an FPGA platform, from which the computing cycles and speed-up metrics are obtained.Vectorisation of the kernel was achieved by (i) auto-vectorisation, (ii) fusing similar loops, and (iii) using intrinsics.This progressive approach offers several possibilities, from the most straightforward and portable approaches to the more complicated ones.
The code adaptations guarantee the portability to future hardware with adaptable vector sizes, while we also expect improved performance with other compilers.Moreover, the new v1.0 of the V-extension will allow the compilation of codes in both Fortran and C, so future porting of Fortran code will be more straightforward when the updated hardware is available.We should note that the ELPA library has a patterned file that can be used to create specific kernels for new architectures.Therefore, while we will focus on a specific kernel, porting can be further replicated throughout the library following an analogous procedure.The experience gained provides practical guidance for other codes and architectures.In addition, our mini-app-based model represents a pragmatic, user-friendly approach to facilitate co-design efforts, cooperatively in finetuning the software and hardware components.The outcomes of these efforts will contribute significantly to advancing the porting of ab initio computational materials and molecular science codes -one of the most relevant families of applications with more users in the HPC community -to (post-)exascale hardware architectures developed in the EU.Furthermore, this evaluation serves as valuable feedback for hardware designers, system integrators and engineers actively involved in compilers for the systems.

William Dawson
RIKEN Center for Computational Science, Kobe, Japan In this paper, Torres et al. describe some early experience porting the ELPA eigenvalue solver library to a RISC-V platform featuring a wide vector unit (as part of the European Processor Initiative).They analyze how efficient the compiler generated code is, and how it can be improved through a sequence of tuning steps.I appreciated the opportunity to read the paper.This is a valuable article for the community, as they are targeting what may be a very relevant platform in the future.The article is cross disciplinary, yet very approachable and the main points are easy to understand.Given its forwardlooking nature and narrow scope, it seems appropriate for a brief report.
The main weakness of this paper is that the authors don't provide any broad, transferable results from this data.If I was tasked with the porting ELPA to a production ready RISC-V supercomputer (or to any new machine), I would do exactly as the authors did: I would consider what compiler flags might be optimal, I would manually fuse loops, I would write intrinsic vectorization instructions, etc.Thus, this paper only describes the basic state-of-the-art, instead of going one step further and really evaluating the platform and creating new knowledge.I recommend that the authors rewrite the conclusion section, which wastes half a page repeating information that is fresh in the mind of the reader of this short report.Instead, I would like the authors to make broad recommendations from their data such as: how should developers prepare their code for the prospect of being ported to this platform, what weaknesses exist in the current compiler toolchains for this system, what other codes should be evaluated based on the weaknesses found here, should we expect this platform to perform well in comparison to other platforms for ELPA if optimization is done properly, if a supercomputer is built on this platforms and runs realistic workflows will this kernel obtain peak performance in this kernel, etc.These are just suggestions, not a to do list; I think the authors will know best which broad conclusions are best justified by the data.
As you note, the authors of ELPA made extensive efforts to specialize for previous platforms, so what's new about the porting experience for this platform?(or maybe nothing is, and your conclusion is that this platform works great if you use standard techniques).This kind of analysis will really allow the authors to impact the community by leveraging their experience and expertise.
Two other broad criticisms I have are: It is not clear to me why the authors use both the Vehave emulator and the FPGA implementation.What are the strengths of the emulator in this regard?Why not just use the (cycle accurate) FPGA?Are the systems tested the same in both cases?Maybe a broad takeaway from this paper could be about which tools to use.

1.
The conversion of the code from Fortran to C risks making this benchmark less realistic.Maybe when the Fortran compiler for this platform becomes available, the compiler would be able to do these optimizations without any problem.If you try to compile the C version you wrote on other platforms, how well does it perform against the Fortran version?Maybe your C conversion introduced artificial challenges for the compiler.I think some checks about this would make the paper more solid.

2.
In addition to these broad recommendations, I have some other small comments which the authors might consider for improving their paper.
In the abstract, just calling Vehave "the user-space Vehave" is unclear, you should write that it's an emulator there.

○
In the introduction, they cite a roadmap article.Since the introduction specifically talks about modular design, they may also consider citing an article describing the CECAM effort [1]   ○ When describing the importance of eigenvalue problems, the authors should clarify that ELPA targets the dense eigenvalue problem, to distinguish it from sparse solvers (though ELPA is used as a building block for sparse solvers).

○
The authors specifically say that eigenvalue solvers take "90% of the total computational time in relatively large systems".Since they have a specific number there, they should cite where they got it from.

○
The authors should describe more specifically what they are testing on."ELPA checkpoints for matrices with different sizes are accessible" -but what size are you using here?Do you use all the data or just one specific problem?○ I am pretty sure the data you have labeled "auto" is after the Q-stripes are suppressed, but if you could explicitly note this point, I think that'd be clearer.

○
For Fig. 2, it could help to repeat the color scheme of Fig. 1 instead of using two shades of blue.I also liked that in Fig. 1 you had a patch with the colors in the table, you should do that in Fig. 2 as well.

○
In the paper you claim the code is published a GPL, but on the repository it's LGPL (as it should be so that code can be backported to ELPA).

○
Results and discussion paragraph 8: RISC-V VPU, composed of an FPGA, and a => RISC-V VPU, which is composed of an FPGA and a ○ I don't have access to a RISC-V machine, but I tested out your source code in the gitlab repository on an x86 machine.I think the instructions were easy to follow, but I had some questions: M_REAL_SIMPLE_F -what optimizations are done here?It's faster than M_REAL_REFERENCE, but slower than M_REAL_SIMPLE.

○
What is the difference between GENERIC and SIMPLE?Why is GENERIC slower than simple? ○

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: I confirm that this potential conflict of interest did not affect my ability to write an objective and unbiased review of the article.
Reviewer Expertise: Electronic structure calculations, high performance computing.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Details
In the systematic optimization methodology, the authors use three strategies to leverage vector instructions and increase computational efficiency, namely, (i) auto-vectorization, (ii) loop fusion, and (iii) manual vectorization with intrinsic.What is important is that performance is evaluated not only with software emulation but also on an FPGA-based hardware platform, showing outstanding improvements in instruction count and execution cycles.
Among the primary indicators of performance improvements are the following: 1.For the ELPA mini-app executed on the FPGA platform, the auto-vectorization reduces instructions and cycles by a factor of respectively 54.5x and 9.7x, compared to the scalar version.

Ahmed Kamaleldin
Technische Universitat Dresden, Dresden, Germany This article presents a method to port the ELPA eigensolver to the RSIC-V based vector processor architecture (EPAC-VEC).The architecture is part of the EPI project, it can handle vectors up to 256 double-precision elements.
The execution was performed on a user-space emulator called Vehave for RISC-V V ISA.The article proposal adapts the ELPA mini-app to run on a long vector architecture.
The evaluation shows that applying those auto-vectorization techniques can reduce the instruction counts by 13.4x and the total number of clock cycles. Comments: 1-Adding several applications (as use cases) for evaluation is recommended.
2-Evaluating the energy efficiency and achieved computing performance is recommended.
3-A comparison with other HPC vector processors could be useful for evaluation.

University of Trento, Trento, Italy
The document discusses the integration of ab initio electronic structure methods with a RISC-V vector architecture in High-Performance Computing (HPC).Ab initio calculations, particularly in Density Functional Theory (DFT), are computationally intensive, with eigenvalue problems often being the primary bottleneck, consuming up to 90% of total computational time in large systems.
The project focuses on porting the ELPA eigensolver library to a RISC-V prototype featuring a wide Vector Processing Unit (VPU).This VPU can handle vectors up to 256 double-precision elements, significantly larger than those managed by conventional SIMD architectures like Intel's AVX-512.The approach involves using a mini-app from the ELPA library to optimize performance on the RISC-V platform.

Details
Three optimization strategies were employed to reduce vector instructions and improve computational efficiency. 1) autovectorisation, 2) loop fusion, and 3)manual vectorisation using intrinsics.
Performance was evaluated using both emulation and FPGA platforms, showing significant reductions in execution cycles and instruction counts.
The work is positioned within the broader context of preparing ab initio methods for exascale computing, where increasing architectural heterogeneity is expected.The co-design approach provides critical feedback for the development of future HPC hardware and compilers, influencing the design of post-exascale systems.This research contributes to the EU's strategic goals in HPC, advancing the readiness of key computational codes for next-generation systems.

Key Performance Indicators
Reduction in Vector Instructions.Initial implementation with autovectorization resulted in 103,784 vector instructions.By suppressing the Q-stripes and leveraging the full vector length of 256 elements, vector instructions were reduced to 17,304, a 6x reduction. 1.
Further optimization with loop fusion reduced the vector instructions to 15,405, a 6.7x reduction from the original.

2.
Manual vectorization using intrinsics reduced the vector instructions to 9,499, achieving a 13.4x improvement over the initial version. 3.
Performance Improvements.The scalar version required 227,443,896 cycles and 231,667,638 instructions.
With fused loops, there was a 9.9x reduction in cycles and a 96.1x reduction in instructions.

2.
Using intrinsics, the improvements reached 23.9x in cycles and 210x in instructions compared to the scalar version.

Memory Access Reduction:
The optimization strategies led to significant reductions in memory accesses, with a focus on the costly load (vle) and store (vse) instructions.For example, memory accesses (vle calls) were reduced from 7,040 to 6,064 after loop fusion.

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: High-Performance Computing, GPU Computing, Distributed-memory systems.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Count of vectorial instructions in the different versions of the ELPA mini-app executed on Vehave.Instructions are distributed by total (light blue bars), 64-bit unit-stride load (vle, orange) and save (vse, orange), settings of vector length (vsetvli, gray), and multiplication-additions (vfmadd in dark blue, vfmacc in green).The reduction in vector instructions of the differently adapted versions compared to the stripped (original) one (which counted a total of 103,784 vector instructions) is indicated over the first bar.Numbers in the graph and table are expressed in thousands.

Figure 2 .
Figure 2. Count of cycles (dark blue bars) and instructions (light blue) for the different vectorised versions of the ELPA mini-app executed on the FPGA system.The speed-up of the differently adapted versions compared to the scalar ones is indicated over the bars, following the same colour code.The y-axis is presented in a logarithmic scale, and all numbers are expressed in millions.PAPI counters were used for these implementations.

Peer Review Status: Version 1 Reviewer
Report 04 September 2024 https://doi.org/10.21956/openreseurope.19799.r43456© 2024 Dawson W. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

○○Introduction, paragraph 2 :○
Introduction, paragraph 1: easily adapted to incorporate => more adaptable to the incorporation of ○ Introduction, paragraph 1: and to perform efficiently heterogeneous computing systems.=> ○ and to perform efficiently "on" heterogeneous computing systems.Introduction, paragraph 2: FHI-aims, GPAW, or CP2K => FHI-aims, GPAW, "and" CP2K.The library can be directly implemented => The library can be directly "integrated" ○ Introduction, paragraph 3: a significant advancement HPC => a significance advancement "in" HPC ○ Introduction, paragraph 4: the relevance of the eigensolver => the relevance of eigensolvers ○ Introduction, paragraph 4: landmark work => pioneering works ○ Methodology, paragraph 2: runs on top of the RISC-V => runs on top of RISC-V ○ Methodology, paragraph 2: This FPGA is used as user-defined => This FPGA is used as a user-defined.○ Methodology paragraph 2: this link => this looks weird with the superscript citations, maybe replace "this link" with "online".○ Methodology paragraph 3: computationally cost => computational cost.○ Methodology paragraph 3: especially to the extensive => especially for the extensive ○ Methodology paragraph 3: the indications to execute => the instructions to execute ○ Methodology paragraph 3: associated online repositories.Here a link would be good (I know it comes at the end of the article, but it'd be convenient to have it here too).○Results and discussion paragraph 3: The increasing performance.I don't understand what this sentence means.I also don't understand why you characterize your strategy as bottomup.Results and discussion paragraph 4: counted for a total => counted a total.○ Results and discussion paragraph 4: without limitations of 48 elements => without the limitation of 48 elements ○ Results and discussion paragraph 6: either put a comma or a colon after 9,499.○ Results and discussion paragraph 7: On the other hand, => I was confused by this part until I realized you meant the sum of vfmadd and vfmacc remains constant, not each one individually.

the work clearly and accurately presented and does it cite the current literature? Yes Is the study design appropriate and does the work have academic merit? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests:
2. The performance was finally improved in the version with intrinsics, obtaining a speedup of 210x in instructions, and 23.9x in cycles.The reviewed research demonstrates the massive potential of the co-design approach and RISC-V open architecture for developing future HPC hardware and compilers, which can achieve unprecedented application performance and computational efficiency.This work makes a clear contribution to European strategy in the HPC domain and promotes technologies fully developed within the EU.No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
://doi.org/10.21956/openreseurope.19799.r42945This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

the current literature? Yes Is the study design appropriate and does the work have academic merit? Yes Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
https://doi.org/10.21956/openreseurope.19799.r42941© 2024 Vella F. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.